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Abstract. In this paper, we define a category DB, called the category of 
simplicial databases, whose objects are databases and whose morphisms are 
data-preserving maps. Along the way we give a precise formulation of the 
category of relational databases, and prove that it is a full subcategory of 
DB. We also prove that limits and colimits always exist in DB and that they 
correspond to queries such as select, join, union, etc. 

One feature of our construction is that the schema of a simplicial database 
has a natural geometric structure: an underlying simplicial set. The geometry 
of a schema is a way of keeping track of relationships between distinct tables, 
and can be thought of as a system of foreign keys. The shape of a schema is 
generally intuitive (e.g. the schema for round-trip flights is a circle consisting 
of an edge from A to B and an edge from B to A), and as such, may be useful 
for analyzing data. 

We give several applications of our approach, as well as possible advantages 
it has over the relational model. We also indicate some directions for further 
research. 
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1. Introduction 

The theory of relational databases is generally formulated within mathemati- 
cal logic. We provide a more modern and more flexible approach using methods 
from category theory and algebraic topology. Category theory is useful both as a 
language and as a tool, and has been successfully applied to many areas of com- 
puter science. Using an inefficient language can hamper ones ability to implement, 
work with, and reason about a subject. This can be seen as one reason that SQL 
implements tables, rather than relational databases in their pure form: perhaps 
mathematical logic is not a sufficiently flexible language for discussing databases as 
they are used in practice. 
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One reason that relational databases have been so successful is that their defi- 
nition can be phrased within a precise mathematical language. The definition we 
provide in this paper is just as precise, if not more so (see the discussion at the 
beginning of Section|4]). However, we go beyond simply defining the objects of study 
(databases), but instead continue on to define morphisms between databases. With 
these definitions, we have a category of databases. 

There are many categories whose objects are databases (the difference being in 
their morphisms); what makes one definition better than another? First, a good 
definition should make sense - the morphisms should somehow preserve the struc- 
ture of the databases. Second, applying common categorical constructions (colimits, 
limits, etc.) to the category of databases should result in common database con- 
structions, such as unions, joins, etc. Third, the categorical approach should make 
reasoning about databases, such as that needed for maintaining and restructuring 
databases, easier. 

Our formulation accomplishes these three goals (see Remark 4.3.8| and Sections 
[5] and [6] respectively). As an added bonus, the schemas for our databases have 
geometric structure (more precisely, the structure of a simplicial set). In other 
words, the schema is given as a geometric object which one should think of as a 
kind of Entity- Relationship diagram for the schema. This approach may lead to 
improvements in query optimization because one can adjust the "shape" of the 
schema to fit with the purposes of the queries to be taken. The ability to visualize 
data should also prove useful, because these visualizations seem to "make sense" 
in practice. Examples of this phenomenon are given in |6.1.1| and |6.1.2| where 
we respectively discuss round trip flights and a sociological experiment involving 
4-cycles in high school partnerships. 

The data on a given schema is given by a sheaf of sets on that schema. Sheaves 
are ubiquitous in modern mathematics because they generalize sets and functions 
and because they have good formal properties. Classical operations on sheaves 
(such as direct images) allow one to transport data from one schema to another 
in a functorial way. One of the main purposes of this paper is to provide a good 
language for discussing databases mathematically, and the consideration of data as 
a sheaf on a given schema helps to accomplish that goal. 

Other researchers have formulated databases in terms of category theory (for 
example, see |RW92j . [TRW02] . |PS95] . [BerOT] . |DK94j . |Dis96j . [GB92] ) . Of note is 
work by Cadish and Diskin, and work by Rosebrugh and Wood. There are many 
differences between previous viewpoints and our own. Most notably, our work uses 
simplicial methods to give a geometric structure to the schemas of databases and 
uses sheaves over these spaces to model the data itself. Both of these approaches 
appear to be new. 

We assume throughout this paper that the reader has a basic knowledge of 
category theory which includes knowing the definition of category, functor, limit, 
and colimit, as well as basic facts such as Yoneda's lemma. Good references for this 
material include [ML98 , BW90J , and |Bor94a| . We do not assume that the reader 
has a prior knowledge of sheaves or of simplicial sets. 

We begin by defining the category of tables, in Section [2j In Section [3] we prove 
that the category of tables is closed under limits and certain colimits, and that these 
constructions correspond to joins and unions. We also prove that projections and 
deletions are easily defined under our formulation. In Section [4] we first give a brief 
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description of simplicial sets. We then proceed to define the category of simplicial 
databases. In Section [5] we prove that the category of simplicial databases is 
closed under all limits and colimits and prove that they again correspond to joins 
and unions. Finally in Section [6] we discuss some applications of our model and 
directions for future research. 

1.1. Acknowledgments. I would like to thank Paea LePendu for explaining rela- 
tional databases to me, for suggesting that databases should be categorified, and for 
his advice and encouragement throughout the process. I would also like to thank 
Chris Wilson for several useful conversations. 

2. The category of Tables 

It is no accident that SQL uses tables instead of relations: Tables are inherently 
more useful, yet just as easy to implement. They are disliked by the purists of 
relational database theory not because they are bad, but because they do not fit 
in with that theory. In this section we provide a categorical structure to the set of 
tables, thus firmly grounding it in rigorous mathematics. 

2.1. Data types. In order to define schemas, records, and tables of a given type, 
we need to define what we mean by "type." 

Definition 2.1.1. A type specification is simply a function between sets tt: U — > 
DT. The set DT is called the set of data types for tt, and the set U is called the 
domain bundle for tt. Given any element T G DT, the preimage 7r _1 (T) C U is 
called the domain ofT, and an element x S 7r _1 (T) is called an object of type T. 

Example 2.1.2. Let U denote the disjoint union U:= (Z II K II Strings) and let 
DT denote the three element set {'Z', 'R', 'Strings'}. Let tt: U — ► DT denote the 
obvious function, which send all of Z to the element 'Z', all of M. to 'R', and all of 
Strings to 'Strings'. The preimage tt^ 1 ('Strings') C U, which we have called the 
domain of the type 'Strings', is indeed the set of strings. 

As another example, the mod 2 function tt: Z — * {'even', 'odd'} is a type speci- 
fication in which the objects of type 'even' are the even integers. 

2.2. Schemas. We quickly recall the definition of fiber product (for sets). 

Definition 2.2.1. Let A, B, and C be sets, and suppose / : A — > B and g: C —* B 
are functions with the same codomain. The fiber product of A and C over B, 
denoted Axb C, is the set 

A x B C:= {(a,c) eAx C\f(a) = g(c) E B}. 

The fiber product moreover comes equipped with obvious projection maps making 
the diagram 

g' a 
I f 
A^^B 

commute. The corner symbol j serves to remind the reader that the object in the 
upper left is a fiber product. We sometimes call g' : A x b C — * A the pullback of g 
along /; similarly /' is the pullback of / along g. 
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Remark 2.2.2. The fiber product of the diagram A — > B £- C above should prob- 
ably be denoted / Xg g instead of A x b C, since it depends on the maps / and g, 
not just their domains. However, this is not often done, and in this paper the maps 
will be clear from context. 

Definition 2.2.3. Let ir: U — > DT denote a type specification. A simple schema 
of type 7r consists of a pair (C, a), where C is a finite (totally) ordered set and 
a: C — > DT is a function. We sometimes denote the simple schema (C, a) by 
a. We refer to C as the column set or set of attributes for a and 7r as the type 
specification for a. 

Let U a := cr _1 (C/) denote the fiber product U Xdt C. We call the pullback 
7TV : L/cr — > C, i.e. the left hand map in the diagram 

U a >*U 



C^-DT, 

the domain bundle on C induced by a. 

Remark 2.2.4. We do not worry much about the ordering on C, as evidenced by 
the fact that we do not record it in the notation (C, a) for the simple schema. In 
fact the ordering requirement can be dropped from the definition if one so chooses. 

The reason we include it is first because the columns of a displayed table naturally 
come with an order (left to right), and second because it results in a more commonly 
used mathematical object down the road in Section |4j See Remark |4.1.1| 

Example 2.2.5. Let tt: U — > DT denote the type specification of Example |2.1.2| 
Let C = ('First Name', 'Last Name', 'Age'), and define a: C -> DT by 

cr('First Name') = 'Strings' 

cr('Last Name') = 'Strings' 

o-('Age') = 'Z' 

We see that C is a set of attributes for the simple schema a. We call C the column 
set because, once we arrange data in terms of tables, the columns of these tables 
will each be headed by an element of C. 

One can check that the domain bundle U„ — > C induced by a is the obvious 
function 

(Strings II Strings II Z) — > C. 

Thus an object of type 'First Name' is a string in this example. 

Definition 2.2.6. Let n: U — * DT denote a type specification. A morphism of 
simple schemas (of type tt), written /: (C, a) — > (C',a'), is an order-preserving 
function / : C — > C such that the triangle 

C 



DT 

commutes. 

The category of simple schemas on tt, denoted is the category whose objects 
are simple schemas and whose morphisms are morphisms thereof. 
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Remark 2.2.7. Let A denote the category of finite ordered sets. Let (A J, DT) 
denote the category for which an object is a finite ordered set with a map to DT 
and for which a morphism is an order-preserving function, over DT. One can easily 
see that the category is isomorphic to (A J. DT), regardless of tt. However, we 
should think of tt as part of the data for a simple schema. 

Note that the symbol A typically refers to the category of non-empty finite 
ordered sets; one typically denotes the category of all finite ordered sets as A + . 
For typographical reasons, we do not follow the standard convention in this paper. 

2.3. Records and Tables. 

Definition 2.3.1. Let (C, a) be a simple schema. A record on (C, a) is a function 
r: C — > U a such that n a o r — id c , i.e. a section of the domain bundle for a. We 
denote the set of records on a by L*"(er), or simply by L(er) if tt is understood. 

In other words, a record must produce, for each attribute c € C, an object of 
type a(c) G DT. 



Example 2.3.2. Let tt and (C, a) be as in Example 2.2.5 A record on that simple 
schema is a section r as depicted in the diagram 

Strings II Strings II Z 



{'First Name', 'Last Name','BYear'}. 

That is, a record is a way to designate a first name and a last name (in Strings) 
and an age (in Z). For example (Barack; Obama; 1961) denotes a record on this 
simple schema; that is, it defines a section of ix a . 

The set T(a) of records on (C, a) is simply the set of all possible such sections. 
In this example T(a) = Strings x Strings x Z. 

Definition 2.3.3. Let 7r: U — * DT be a type specification. A table of type tt 
consists of a sequence (K,C,a,r), where K is a set, (C,a) is a simple schema of 
type tt, and r: K — > T(a) is a function. We sometimes denote the table (K, C, a, r) 
simply by t. The set K is called the set of keys of r, and (C, a) is called the simple 
schema of r. 

Remark 2.3.4. Given a table (K,C,(t,t), those familiar with SQL should think 
of the set K of keys as the set of row identifiers for a table. These row ids are 
always unique identifiers and serve as an internal key system for the table; they are 
generally not considered as part of the data. 

Remark 2.3.5. We do not require our tables to have finitely many rows. One could 
easily enforce such a restriction if desired, and follow the rest of the paper with 
that restriction in mind. The resulting category would be a full subcategory of the 



one we present in Definition 2.4.1 it would still be closed under finite limits (etc.), 



and queries would be taken in precisely the same way as they are here. 

Example 2.3.6. Given a simple schema (C, a), a table on it is simply a collection 
of records indexed by a set K. The records need not be distinct because the set K 



keeps track of the distinctions. Continuing with tt and (C, a) as in Example 2.3.2 
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we could have K = {1,2, l foo'} and let r: K — > T(a) be the assignment 

1 i ► (Barack; Obama; 1961) 

2 i ► (Michelle; Obama; 1964) 
'foo' h- > (Barack; Obama; 1961) 

This table can be written in more standard form as: 



K 


'First Name' 


'Last Name' 


'BYear' 


1 


Barack 


Obama 


1961 


2 


Michelle 


Obama 


1964 


'foo' 


Barack 


Obama 


1961 



We indicate with the double vertical line the fact that this table corresponds to a 
function whose domain is K. 

Lemma 2.3.7. Let tt: U — ► DT denote a type specification, let (Oi,0i) and 
(02,02) denote simple schemas on ir, and let f: (02,02) ~* (Oi,0i) denote a mor- 
phism of simple schemas. There is an induced map on record sets f* : T(o-\) — > 

r(0 2 ). ' 

Proof. Consider the diagram 




DT. 



Note that the left hand square is a fiber product square. This follows by apply- 
ing basic category theory (specifically the "pasting lemma" for fiber products; see 
[ML98]) to the fact that the right hand square and the big rectangle are fiber prod- 
uct squares. We must show that a section r\ : C\ — ► U ai of ni induces a section 
r 2 : 2 — > U a2 of 7r 2 , because this assignment will constitute /* : r(oi) — > r(cr 2 )- 

Suppose given n with 7Ti o n = idd ■ We have a map r± o / : O2 — > J/o-j and a 
map idc 2 : O2 — > 2 such that /oidc 2 = / = 7r i°( r i°/)- By the universal property, 

C 2 - 



these two maps define a map r2 : O2 
This is the desired section of tt2- 



U„ 2 such that, in particular 1x2 o r 2 = id 



□ 



Given a morphism /: 2 — > 01 of simple schemas, the function /* : T(cti) — > 
r(02) defined in the above lemma is said to be induced by /. 

Definition 2.3.8. Let 7r: U — > DT be a type specification, and let (ifi, Oi, 01, ri) 
and (i^2) O2, 02, T2) denote tables. A morphism of tables (p: t\ — > T2 consists of 
a pair (g,f), where <?: A'i — > A2 is a function and /: (02,02) - * (Oi,0i) is a 
morphism of simple schema such that the diagram of sets 



(1) 



9 

K 2 



r(0i) 

/* 

r(0 2 ) 



commutes, where /* : T{a\) — > r(o 2 ) is the function induced by /. 
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Example 2.3.9. Let us continue with Example |2.3.6| except for a slight renaming 
of objects: C\\— C,a\\= <r,K\\= K, and T\\= t. Let C 2 = {'First', 'Last'} 
and let er 2 send both elements to the data type Strings S DT; thus r(<7 2 ) — 
Strings x Strings. 

Let Ki = {5, 6, l bar'} and t 2 be the assignment 

5 1— > (Barack; Obama) 

6 1— » (Michelle; Obama) 
'bar' \— > (George; Bush). 

A morphism of tables (p : n — ► t 2 should consist of a map g : K\ — > K 2 and a 
map /* : L(Ci) — > r(C 2 ). We have an obvious map of simple schema /: C2 — > Ci, 
namely 'First' > 'First name' and 'Last' 1— > 'Last name'. Then /* : T(a±) — > r(cr 2 ) 
is just the projection Strings x Strings Strings x Strings. 

Now, to define a morphism of tables ip: t\ — > t 2 , our choice of g must send both 
of the records (Barack; Obama; 1961) in T\ to the record (Barack; Obama) and 
send the record (Michelle; Obama; 1964) to the record (Michelle; Obama). There 
is a unique such morphism <f> in this case. 

For a variety of reasons, there does not exist a morphism of tables t 2 — ► T\. 



Remark 2.3.10. The morphism of tables in Example 2.3.9 has a common form 



As in the example, a morphism of tables often is composed of a projection (in the 
columns) together with an inclusion (in the rows). The requirement that the square 
(JT|) in Definition 2.3.8 commutes is simply the requirement that morphisms preserve 
the integrity of the data. 

2.4. The category of tables. We have now defined tables and morphisms between 
tables. Given morphisms depicted 



Ki 



I 

K 2 



I 

K 3 



r(<7!) 



r(<7 2 ) 



r(<7 3 ) 



it is easy to see how composition is defined. It is also easy to understand the identity 
morphism on a table r: K — > T(C). Thus we have a category. 



Definition 2.4.1. Let n: 

whose objects are tables K 



U — » DT denote a type specification. The category 
— > r(cr) and whose morphisms are commutative squares 
as in Definition 2.3.8 is called the category of tables on it and is denoted Tables^, 
or simply Tables, if n is understood. 



Example 2.4.2. Suppose ir: U — > DT is as in Example |2.2.5| Suppose that G = 
{ci,c 2 } and C = {c^}, and that a: C — > DT and a': C — > DT are the unique 
maps such that T(a) = ZxZ and T(a') = Z. Let K and K' be any two sets and 
t: K -> r(cr) and t' : K' —> T(a') be any two tables. 

For a morphism t\ — * r 2 in the category of tables, we are allowed any kind of 
function between key sets K — > K' , but the only permitted maps Z x Z — ► Z 
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are the two projections, because they are the only maps which are induced by 
morphisms of simple schema. 

Definition 2.4.3. Let tt : U — > DT denote a type specification and let cr: C — > DT 

denote a simple schema. The category of tables on a of type it, denoted Tables^ 
is the category whose objects are tables r: K — > T(a) and whose morphisms are 
triangles 




denoted by g : n — > . 



2.5. Relational tables. The most common formulation of databases used today 
is the relational model, invented by E.F. Codd (see [Cod700 . It is based on the 
theory of mathematical logic, and more specifically on relations. One can find a 



modern treatment of the subject in [Dat05 . We define a relation in Definition 2.5.1 
as a type of table, where the map r: K — ► r(<r) is required to be an injection. 

Definition 2.5.1. Let it : U — ► DT denote a type specification, and let cr : C — + DT 

denote a simple schema on it. A relation on a is a table r: K — > L(cr) for which t 
is an injective function. 

A morphism of relations is a morphism of tables, for which the source and target 
tables are relations. That is, the category of relations, denoted Rel^ is the full 
subcategory of Tables^ spanned by the relations. Similarly, given a simple schema 
a, the category of relations on a is the full subcategory of Tables^ spanned by the 
relations. As usual the superscript 7r can be dropped if it is understood. 

There is a functor Rel — > Tables and a functor Rel ff — > Tables^, both of which 
are simply inclusions of full subcategories. 

3. Constructions and formal properties of Tables 



Our definition for the category of tables (Definition 2.4.11 is sensible because 
objects are tables and morphisms are data-preserving maps. In this section we show 
that category-theoretic operations on tables correspond to operations on databases, 
such as joins and other queries. Fix a type specification tt: U — * DT for the 
remainder of the section. We will drop i as a superscript in this section; for 
example the category S 7 ' of simple schema on n will be denoted simply by S. 

We sometimes refer to the underlying keys or underlying simple schema of a 
table, so we record these trivial constructions in a remark. 

Remark 3.1.2. There is a forgetful functor Tables — ■+ Sets given by sending a table 
t: K — > r(er) to the key set K and a morphism of tables to the underlying map 
of keys. There is another forgetful functor Tables — > S op which sends the table 
r to its simple schema a and a morphism <p — (g, /) of tables to the underlying 
morphism of simple schema /. 

Lemma 3.1.3. There exists a final object and an initial object in Tables. 

Proof. One checks immediately that if we take K to be a terminal object in Sets 
(i.e. any set K with cardinality 1) and a to be the inital object — * DT in S, then 
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there is exactly one table with these as its underlying keys and simple schema, and 
this table is the terminal object in Tables. 

One also checks immediately that if we take K — to be the initial object in 
Sets and a = idDT> DT — > DT to be the final object in S, then there is exactly 
one table with these as its underlying keys and simple schema, and this table is the 
initial object in Tables. 

□ 



Certain colimits exist in Tables; namely colimits of diagrams that are constant 
in the underlying simple schema. 

Construction 3.1.4. Let t\ ■ Ki — > L(er) and r 2 : K 2 — > r(<r) be two tables with the 
same simple schema. By taking the disjoint union of K\ and K 2 we get a new table 
r: K x II K 2 T(a). This query is called UNION ALL in SQL. 

We can also take the (non-disjoint) union of these two tables, if we know how they 
overlap. That is, if there is some set K with maps g\\ K — ► K\ and g 2 : K — > K 2 in 
such a way that T\og 1 = r 2 og 2l then we can obtain a new table r : K\\1kK 2 — > r(cr). 
This query is called UNION in SQL. 

We will see that limits in the category of tables correspond to generalized joins. 

Proposition 3.1.5. All finite limits exist in Tables. 

Proof. It suffices (see, for example, [MLM94, p. 30]) to show that Tables has a 
terminal object and is closed under taking fiber products; the first of these facts 
was shown in Lemma |3.1.3| For the second, suppose we have a diagram 



(2) 



K 

+ 



r(c i) 

/r 

l / 2 * 



-r(a 2 ) 

in Tables, where a: C — > DT and Oi \ C{ — > DT for i — 1,2 arc simple schemas. 
As indicated, the maps r(<Tj) — ► L(cr) are induced by morphisms of simple schema 
fi\ a— > (Tj, for i = 1,2. 

Consider the simple schema 



U CT (T 2 ): Ci lie C 2 — > DT 

induced by taking the colimit of the column sets. We would like to show that the 
natural function 



(3) 



!>! II <j a 2 ) — >r( CTl ) x r(CT) L(a 2 ) 



is a bijection. 
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Let us first calculate the set T(ai JI a a 2 ). It is the set of all sections r of the map 
it' in the diagram 

(<ri H ff <j 2 )-\U) 

< j 

r ■ 7r' 

Ci Uc C 2 > DT. 

To give such a section is to give, for each c\ € Ci an element of ir~ 1 (o~i(ci)), and 
for each c 2 £ C2 an element of tt^ 1 (a 2 (c 2 )) , in such a way that for all c £ C, the 
induced elements in 7r _1 (<7i(/j(c))) are the same for i — 1,2. This is precisely the 
data needed for a unique element of the set r(cr 1 ) Xr(o-) r(<72); this proves the claim 
that the map in ^ is a bijection. 

It now follows that the fiber product of Diagram ^ is the table 

Ti x T t 2 : K ± x K K 2 — > r(cri II(j 02) 

obtained by taking the fiber product of sources and targets in ([2]), and the induced 
map between them. 

□ 



Proposition 3.1.5 gives the formula for the join of two tables over a third. As one 
sees from the construction, the columns of the join are the union of the columns of 
the given tables, and the key set is the fiber product of the key sets of the given 
tables. 

Lemma 3.1.6. Let a: C — > DT denote a simple schema. The category Tables^ 
of tables on a is closed under small limits and colimits. 

Proof. The category of sets is closed under small limits and colimits. To take the 
limit or colimit of a diagram X: I — > Tables^, simply take the limit or colimit 



(respectively) of the underlying diagram of key sets - see Definition 3.1.2 This 
set comes with a natural map to r(cr), and one shows easily that it is the limit or 
colimit (respectively) of X. 

□ 

Example 3.1.7. Let a: C — > DT denote a simple schema. The initial and final 
objects in Tables^ arc — * T(a) and idr(cr) : r(er) — > T(a), respectively. 

Construction 3.1.8. Let r: K — > T(a) be a table with simple schema a: C — ► DT, 
and let C C C be a subset of its column set. There is an induced table 

t\ c >- K^T{a\c>). 

In SQL this construction is called the projection of r onto the subset C C C of 
columns. 

Using the projection query, one can realize a SELECT query as a limit of 
databases. 

Construction 3.1.9. Let us construct the SELECT query. One begins with a table 
r : K —> T(a) with simple schema a : C — > DT, from which to select. Let / : C'cC 
be a subset of its columns, and let a' = a\c- C — > DT be the restricted simple 
schema. One may select from r all records whose restriction to C is a member of 
some list. We encode this list as a table r'\ K' — * L(ct') on a'. 
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In order to select from r all records whose restriction to C is in the table r', 
take the limit of the diagram 

K r -^T(a) 



for 



IV) 



K' 



id 



/* 

¥ 

-T(a>) 

A 

id 

• IV). 



This limit is the desired SELECT query. 



Example 3.1.10. Let r: K — > L(er) be the table from Example 2.3.6 To select all 
instances for which the first name is Barack, let C = {'First Name'}. Let t' denote 
the one-row table 



K' 


'First Name' 


k' 


Barack 



Both r and r' have a canonical map to the terminal table on C, the table with one 
column ('First Name') and with a row for each element of Strings. Of course, this 
terminal table is too big to write down, but we do not need it. The fiber product 
is easily computed to be the table 



K 


'First Name' 


'Last Name' 


'BYear' 


1 


Barack 


Obama 


1961 


'foo' 


Barack 


Obama 


1961 



We conclude this section by a quick remark on the category-theoretic properties 
of the relational tables. 

Remark 3.1.11. Relations behave much like ordinary tables. Limits exist in Rel 
and Rel CT . The functor Rel — > Tables preserves limits, and the functor Relcr — > 
Tables^ preserves limits but does not preserve colimits. 

We take the viewpoint that the "correct" way to take a colimit of a diagram 
X : J — > Relo- is to pass to the diagram / — > Tables^ and take its colimit instead. 
This claim, in particular, says that sometimes UNION ALL is more appropriate 
than UNION is. Since UNION ALL is not legal in the strict relational database 
theory (or it would be the same as UNION), our viewpoint could be seen as con- 
troversial to purists of the relational model. 

4. SCHEMAS AND DATABASES 

A relational database is a set of relations, together with a system of keys and 
foreign keys which link the relations together. The definition of relations themselves 
is, of course, quite mathematically precise. However, the precise way in which these 
relations are allowed to be linked together is rarely written down as a mathematical 
structure in its own right, either in research papers or textbooks (we could not find 
it in [DatOS] or |EN07| . for example). For example, ER diagrams are exemplified 
or even defined, but not as a mathematical object (like relations are). There are 
exceptions, such as [RW92 2.1], but as far as we know, these definitions are not 
actually the ones used, either by practitioners or by theorists. 
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In this section we will define simplicial databases in a rigorous way (see Definition 



4.3.3). Although examples will be plentiful, they will never stand in for precise 
definitions. We will also define morphisms of databases, thus making explicit the 
idea of "data-preserving maps." Providing a precise definition of the category of 
databases may be useful to database theorists, as well as to people interested in 
studying mathematical informatics. 

4.1. Schemas. Roughly, a simplicial set is a picture that can be drawn with ver- 
tices, edges, solid triangles, solid tetrahedra, and solid "higher-dimensional tetra- 
hedra." For any integer n > 0, an n-dimensional solid tetrahedron, or n-simplex, is 
the "diagonal triangle" shape in K n+1 given by the algebraic equation x% + X2 + 
■ • • + x n+ i = 1 and the inequalities x% > for 1 < % < n + 1. To draw with these 
shapes is to connect various tetrahedra together along their faces (or subfaces). For 
example, one could connect four triangles together along various faces to obtain an 
empty tetrahedron, the boundary of the 3-simplex. 

Simplicial sets arc a fundamental tool in algebraic topology, and are important in 
many other fields within mathematics, such as combinatorial commutative algebra. 
See [Fri08] or |GJ99j for details. 

A database is a system of tables which are connected together via foreign keys. 
This information is part of the schema for the database. In our formulation, we keep 
track of this information using (something akin to) simplicial sets as our schema. 
Tables are connected together when the corresponding simplices are connected. 



We use a slight variant of simplicial sets, which we will define in Definition 4.1.2 
Namely, since columns can only take entries in a given data type, we must keep 
track of this information. For this reason, the simplicial sets we use as schema have 
labeled vertices, where each label is an element of DT. We do not define schemas 
exactly this way, however, because a more generalizable way to phrase it may be 
useful for future generalizations. 

Remark 4.1.1. As mentioned in Remark |2.2.4| some prefer the columns of each 
table in a database to be unordered, whereas we have chosen to consider them as 
an ordered set. Simply using symmetric simplicial sets, a variant of simplicial sets 
in which vertices are unordered, will solve any such issue. See |Gra01] for details 
on symmetric simplicial sets. 

Definition 4.1.2. Let A denote the category of finite ordered sets, let it: U —* DT 

be a type specification, and let 

S = (A jDT) 



denote the category of simple schema on n (see Definition 2.2.3 and Remark 2.2.71. 
We define the category of schema on n, denoted Sch 7r to be the category whose 
objects are functors X : S° p — ► Sets and whose morphisms are natural transforma- 
tions of functors. 

Let X 6 Sch^ denote a schema. Given a simple schema a: C — ► DT, the 
a -simplices of X are the elements of the set X(a), and we write X a to denote 
X(a). 

Remark 4.1.3. Given a category C, the category whose objects are functors C op — > 
Sets and whose morphisms are natural transformations of functors is called the 
category of presheaves on C and denoted Pre(C). It is a common mathematical 
construction which "formally adds all colimits to C." That is, Pre(C) is closed 
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under taking colimits, and for any functor C — > T> to a category T> which is closed 
under taking colimits, there is a unique colimit-preserving functor Pre(C) — > T> 
over C. See, for example, |MLM94[ 1.5.4]. 

Thus, we have Sch^ = Pre(5 7r ). Since S n signifies the category of ways to 
set up columns of a tables, Pre(5 7r ) is the category of ways to glue such things 
together. 

Remark 4.1.4. The category of (augmented) simplicial sets is the category Pre(A). 
The only difference between it and Pre(5 7r ) = Pre(A J. DT) is that each simplex in 
Sch^ has labeled vertices, whereas simplices in Pre(A) do not. In the introduction 
to this section we described simplicial sets in terms of tetrahedra. After making 
the necessary modifications, we see that a schema is constructed by gluing together 
labeled tetrahedra along their faces, where we only allow these tetrahedra to be 
glued if their labels match. 

If X is a schema, we sometimes refer to the simplices of its underlying simplicial 
set as simplices of X. Thus, the n-simplices of X is the union of all er-simplices of 
X, where er: C — » DT is a simple schema with cardinality card(C) = n + 1. That 
is, we write 

X n = J J X a . 

{er: C-vDT|card(C)=n+l} 

There is a classifying map s: Xq = H a6 DT(^a) — > DT which sends all of X a to a, 
for each a £ DT. 

One of the best features of the schema we are presenting here is their geometric 
nature, as described in the first paragraph of this section. Unfortunately, Definition 



4.1.2 does not make the geometry explicit at all. Hopefully the next few examples 



will help make it more clear. 

Example 4.1.5. Let a: C — ► DT denote a simple schema. It naturally defines a 
schema X = A a as the functor which sends a simple schema er' : C — > DT to 
the set X„i = Homier', er). If C has n + 1 elements, one visualizes A CT as an 
n-dimcnsional tetrahedron whose vertices are labeled by elements in the image of 
er. 

This is not just a heuristic: there is a geometric realization functor Re : Sch — > 
Top which realizes every schema as a topological space in a natural way, and 
behaves as we have described for simplices A CT . 

As an example, suppose C has two elements and their images under a are a,b € 
DT. We imagine A CT as a line segment, whose vertices are labeled a and b. If C 
has three elements and er' sends two of them to a and one of them to 6, we imagine 
A CT as a filled-in triangle, whose vertices are labeled a, a, and b. The figures we 
have imagined are the images of er and er' under Re. 

Definition 4.1.6. Let er 6 S denote a simple schema. The schema A a E Sch 
defined in Example |4.1.5| is called the a -simplex and, as a functor <S op — > Sets, is 
said to be represented by a. 

Example 4.1.7. We have mentioned that every object in Sch^ can be obtained by 
gluing together simplices. This is proven in [Bor94a, 2.15.6]. Let us explain how we 
would construct the union X of two edges along a common vertex. Suppose that 
the common vertex is labeled b and the other vertices are labeled a and c. The 
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schema A is obtained as the colimit of the diagram 

^(a,b) ^_ ^(b) _^ ^(b,c) 

taken in Sch^. 

We will now write down this schema explicitly as a presheaf on 5", i.e. as a 
functor A: (A J. DT) op -> Sets. Given <r: C ->■ DT, we let X a be a single 
element if the image of a is contained in {a, b} or contained in {6, c}. Otherwise we 
take X CT to be the empty set. 

Example 4.1.8. A basic example of a schema is that of a set of labeled vertices with 
no edges or higher simplices connecting them. This is obtained as a coproduct of 



O-simplices (see Remark 4.1.3), and it is called a discrete schema. 



4.2. Sheaves on a schema. 

Definition 4.2.1. Let X e Scti* denote a schema. A subschema of X consists of a 
schema X' € Scti* such that for every a £ 5 T we have X' a C X a . The subschemas 
of X form a category Sub(A), in which there is a morphism X" — > A' in Sub(A) 
if and only if A" is a subschema of A'. 

We will soon be discussing colimits in the category Sub(A). One should note 
that Sub(A) is particularly nice, in that the colimit of any diagram £):/—*• 
Sub(A) is the smallest subschema A' C A which contains D(i) for all i € /. In 
the language of lattices or locales, one writes colim(Z?) = \J eI D(i). See |B or 94b I 
1.3]. 

Definition 4.2.2. We define a sheaf on A to be a functor 1C: Sub(A) op — > Sets 
such that, for every diagram D: I — » Sub(A), the natural map 

£(colim(£>)) — ► lim(£(D)) 

is an isomorphism. That is, /C must send colimits of subschema to corresponding 
limits of sets. 

A morphism of sheaves on A is a natural transformation of functors Sub(A) op — > 
Sets. Let Shv(A) denote the category of sheaves on A. 

Remark 4.2.3. Category theory experts will recognize Shv(A) as the category of 
sheaves on a certain Grothendieck site (the locale of subobjects of A). It is well 
known that Shv(A) is therefore closed under small limits and colimits. Moreover, 
there is an adjunction 

Pre(A)^=i:Shv(A) 
for which the right adjoint is the forgetful functor and the left adjoint is called 
sheafification. Roughly, one sheafifies a presheaf on a schema by replacing its value 
on each union of simplices by the fiber product of its values on those simplices. See 
|MLM94] for details. 

Example 4.2.4. For any schema A, there is an object € Sub(A), which is the 
colimit of the empty diagram on Sub(A). Hence if JC is to be a sheaf on A, one 
must have /C(0) = {*}. 



If A is a discrete schema (see Example 4.1.81, then A is the coproduct its 0- 
simplices. Thus, if JC is to be a sheaf on A, we must have 

k[X) = n 

xEXo 
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Example 4.2.5. Suppose that X € Sch w is the schema A^' str ''' z '\ which looks like 
this: 

Str - 'Z' 

The category Sub(A) is a partially ordered set with five objects: 0, • ' str ',«' z ', 
(•' str ', »' z '), and X itself; Sub(X) has inclusions as morphisms. 

A sheaf K. € Shv(X) assigns a set to each of these five objects, and functions 



to each inclusion. However, by Example 4.2.4 it must assign to the terminal 
set, /C(0) = {*}, and it must assign to (•' s * r >' z ') the product /C(»' str ') x /C(»' z '). 
Thus, to specify a sheaf, we need only specify two values, and one morphism, namely 
K{X) ^/C(.' str ') x /C(.' z '). 

For example we may choose on objects the assignments JC(X) = {4, cc, 10}, 
/C(V str ') = {1, 2}, and /C(»' z ') = {x, y, z}; this implies /C((»' str ', »' z ')) is isomorphic 
to {la;, ly, Iz, 2x, 2y, 2z}. Any function from {4, cc, 10} to this six element set, say 
4 i— > lx,cc i— » 2z, 10 i — ► 2z, defines the restriction maps in our sheaf /C. These 
restriction maps can be thought of as "foreign keys." 

Definition 4.2.6. Given a schema X 6 Sch^, we have been working with the 
category Sub(X) of subschemas of X. There is a related category, called the 
category of nonempty non- degenerate simple schemas over X and denoted ND(X), 
whose objects are monomorphisms A CT ^ X in Sen 71 , where a: C — > DT is a 



schema with C ^ (see Example 4.1.51, and whose morphisms are commutative 
triangles. 

Every simplex in a schema has a unique underlying non-degenerate simplex (of 
which it is the degeneracy), so one can define a functor ND: Sch^ — > Cat. 

Since every injection A a c — > X is in particular a subschema, there is an obvious 
functor 

ND(X) -> Sub(A). 

This induces an adjunction Pre(ND(X))^^Pre(Sub(A)). No nontrivial unions 
exist in ND(X), so this adjunction becomes 

Pre(ND(X))^±:Shv(Sub(A)), 

R 

where Pre(ND(A)) is the category of presheaves ND(I) op -> Sets. See |Joh02[ 
C.l.4.3] for more details on this type of construction. 

Proposition 4.2.7. Let X e Sch^ be a schema, and let ND(X) denote the cate- 
gory of non- degenerate nonempty simple schema over X. The adjunction 

Pre(ND(X))^^Shv(Sub(A)), 

R 

is an equivalence of categories. 

Proof. It is an easy exercise to show that the composition L o R is equal to the 
identity on Pre(ND(A)) and that K o L is canonically isomorphic to the identity 
on Shv(Sub(X)). 

□ 

Proposition |4.2.7| says that one does not have to worry about sheaves: the cate- 
gory Shv(X) is equivalent to a category of functors (without "sheaf requirements). 
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Lemma 4.2.8. Let tt: U — > DT denote a type specification and let f: X — > Y 

denote a morphism of schema on tt. There is an adjunction 

Shv(Sub(F))^=±:Shv(Sub(X)) 

/, 

defined as follows for sheaves JCx £ Shv(Sub(X)) and JCy G Shv(Sub(Y")) . For 
any U G Sub(X) we take 

f*JC Y (U):=IC Y (f(U)), 
where f{U) G Sub(Y") is the image of U in Y. For any V G Sub(Y) we take 

UK X {V):= K X {J-\V)), 
where /^ 1 (V^) is the preimage ofV in X. 

Proof. Colimits of presheaves are computed objectwise, and it follows from Propo- 
sition [OTT] that the functor /*, defined above, preserves colimits. Hence, it suffices 
to show that for any representable sheaf rY' = Hom Sub( -y)(— , Y') G Shv(Sub(Y")) 
and sheaf T G Shv(Sub(X)), one has an isomorphism 

Rom( f*(rY'),T) ~ ? Hom(rY', /*T). 

To begin, note that for any U G Sub(X) one has a chain of natural isomorphisms 

r(rY')(U):= (rY')(f(U))^Rom Sub{Y) (f(U),Y') 

= Hom Sub(x) ([/, f-\Y')) - r(f-\Y'))(U). 

That is, f*(rY') = r(f~ 1 (Y')). By another chain of natural isomorphisms, we have 

Hom(/* (rY'),T) - Hom(r(/- 1 (y')),r) 

-nr^Y')) 

=: UT(Y')=Hom(rY',UT). 

This proves the lemma. 

□ 

4.3. Simplicial databases. We think of a schema as a way of organizing the data 
in a database. Before we say what a database is, let us give one more example of a 
schema. In some sense it will be the fundamental example of a schema; however, it 
should not really be thought of as a way to organize the data, but as the meaning 
of the data itself. 

Example 4.3.1. Let ir: U — * DT denote a type specification, and let S = denote 
the category of simple schema on tt. Let Y* : S op — > Sets denote the functor which 
assigns to a schema a : C — > DT the set r 7r (cr) of records on a (see Definition 2.3.1 1. 



By Lemma 2.3.7 a map a — > a' induces a function r' r (cr') — > T^cr), so T 71 is 
indeed a contravariant functor. By definition we can consider Y* as a schema on tt 
and write G Sch". 

We call r w the universal record on tt, for reasons which will be clear soon. If the 
type specification n: U — > DT is obvious from context, we may denote T n simply 

by r. 
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Definition 4.3.2. Let 7r: U — > DT denote a type specification, let L" denote the 
universal record on 7r, and let A S Sch" denote a schema on ir. The universal sheaf 
on X of type ir is the sheaf W whose value on a subschema X' C X is the set 

^(X') = Hom Sc h'CX',r w ). 

Each element of U^(X') is called a record on X' of type it. If ir is clear from context, 
we may write hi to denote W . 

Now let X, Y G Sch" be schema and let and Uy denote the universal sheaf 
of type i on 1 and Y , respectively. A map of schema / : Y — > X induces a 
morphism Uf. f*Ux — > Uy as follows. Let Y' C Y denote an object in Sub(y); 
then composing with / induces a natural map 

f*U x (Y') =Hom Sc h-(/(y').r ,r ) ^Homsch^',r w ) =Wy(y'), 
which we denote W/; it is similarly defined on morphisms. 

Definition 4.3.3. Let ir: U — > DT denote a type specification. A simplicial 
database ( or simply database) of type 7r is a triple (X, /C, r) where A G Sen" is a 
schema of type ir, JC G Shv(A) is a sheaf of sets on Sub(A), and r : JC — > is a 
morphism of sheaves on A (see Definition 4.3.2). We refer to A as the schema, JC 
as the sheaf of keys, and r as i/ie daia of the database (A, fC, r). 

Remark 4.3.4. Given a set of ways to measure objects, it often happens that we 
have several objects with the same measurements. For example, we may have 
three green apples, or two 1999 Toyota Corollas. In relational databases, if two 
objects have the same attributes, then they are taken to be the same instance. 
To keep them distinct, one introduces a unique identifier, an artificial key, which 
becomes part of the data. This causes problems with database integration, because 
the arbitrarily-chosen artificial keys in one database will generally not match with 
those in another. 

In our definition, the keys for the data are kept separate, as the sheaf of sets /C. 
Different names for the keys in no way affect the data itself and therefore do not 
interfere with database integration. We say more about this in Section [53] 



Example 4.3.5. In Example 4.2.5 we wrote down a sheaf K, G Shv(A) on the 
schema 

A= ' str '. 

and we will continue to use it in this example. To specify a database on A of type 
7T, we must give a morphism r: K, — > hi n of sheaves on A. 

We defined the universal sheaf Ux of type 7r on A in Definition 4.3.2 We have 

U X {X) = U x {{* Str \* Z ')) = Str x Z 

^(•' Str ') = Str 

Ux{Q) = {*}. 

To define a map r : K, — > Ux , we must give maps 

r(. str ') : £(.' str ') - ZM.' Str '), : JC(. -> U x {^') 

and 

r(A): JC{X)^U X {X) 
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that compose correctly with the restriction maps. We arbitrarily assign 

t(1) = Barack t(x) = 1961 
t(2) = Michelle r\y) = 1946 
t(z) = 1964. 

Now K,{X) — {4, cc, 10}, and the restriction map sends 4 i— > Ix, cc <— > 2z, and 
10 i — ^ 2z. This forces t(4) = (Barack; 1961) and t(cc) = r(10) = (Michelle; 1964). 
The other values and restriction maps for K, are now also forced. 

Example 4.3.6. In Example |4.3.5[ we followed the definitions very closely, perhaps 
to the detriment of the big ideas. In this example, we write down how the sheaf 
"looks" as a collection of tables. 

Let us first change the schema X very slightly, by instead using the schema 
a: {First, BYear} — > DT, where cr(First) = 'Str' and cr(BYear) = 'Z', and now 
taking X = A a . The only difference is that we have labeled our columns by more 
specific attribute names. We write t(X) : JC{X) — » Ux{X) as the table 



t(X) 



JC(X) 


First 


BYear 


4 


Barack 


1961 


cc 


Michelle 


1964 


10 


Michelle 


1964 



We write T(» tlrst ) and r( 



.BYear 



) as the tables 



r(. First ) = 



/C(» First ) 


First 


1 


Barack 


2 


Michelle 



r(. BYcar ) = 



/C(. BYear ) 


BYear 


X 


1961 


y 


1946 


z 


1964 



We can consider the restriction maps JC{X) — ► /C(» Flrst ) and JC{X) — > /C(» BYear ) 
as foreign keys attached to the t(X) table. The way things are set up, this foreign 
key information is kept in the restriction maps of the sheaf JC . See Example |4.2.5| 

Definition 4.3.7. Let7r: U — > DT denote a type specification, let X = (X,ICx,tx) 
and y = (Y,K,y,t y ) denote databases of type ir, and let Ux and Uy denote the 
universal sheaf on X and Y (see Definition 4.3.21. A morphism of databases, de- 
noted 

consists of a map /: Y — > X of schema (see Definition 4.1.21 and a morphism of 
sheaves /" : f*K-x — * ICy on Y such that the diagram of sheaves 



(4) 



f*jc x / ^ ru x 



T 

-Uy 



commutes. 

The category of simplicial databases on n, whose objects are simplicial databases 



as defined in Definition 4.3.3 and whose morphisms have just been defined, is de- 
noted DB T , or simply DB if n is understood. Fixing a schema X, the category of 
databases on X, denoted DB^, is the category whose objects are databases with 
schema X and whose morphisms are identity on X. 
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Remark 4.3.8. A database is roughly a bunch of tables glued together by foreign 
key mappings. A morphism of databases is a way to coherently assign to each table 
in one database, a table in another database, and a morphism between the two 
tables. Recall that a morphism of tables is a "data-preserving map" (see Definition 



2.3.8 Example 2.3.9 and Remark 2.3.10 ). Thus, a morphism of databases should 



be thought of as a coherent system of data-preserving maps. 

We might make the following definition. A morphism without integrity is a pair 
(/, /$) : X — ► y as above, but without the requirement that diagram (W| commute. 

Remark 4.3.9. Let Y be a schema and let Uy denote the universal database on Y. 
One can identify DBy with the category Shv(y) iu Y of sheaves over Uy ■ Explicitly, 
this is the category whose objects are arrows K, — > Uy and whose morphisms are 
commutative triangles. 

4.4. Relational simplicial databases. In this subsection, we present a cate- 
gory of relational databases as a full subcategory of the category DB of simplicial 
databases. We also give an adjunction which allows one to convert a database in 
our sense to a relational database in a functorial way. 

Definition 4.4.1. Let 7r denote a type specification. A simplicial database X — 
(X, JC, t) on 7r is called relational if t : K, — > Ux is a monomorphism of sheaves. The 
category of relational simplicial databases, denoted IZel^ is the full subcategory of 
DB" spanned by the relational simplicial databases. 



Note the precise similarity of this definition with Definition |2.5.1| the schema 
X is a gluing together of simple schema a, the sheaf Ux evaluated on a simplex 
A CT C X is r(<r), and a monomorphism of sheaves is a morphism which restricts to 
an injective function on each simplex. 

Every function / : A — > B between sets has an image im(/) C B and an injection 
f m : im(/) — ► B; similarly, given a schema X, every morphism / : A — > B of sheaves 
of sets on X has an image sheaf denoted im(/) C B and a monomorphism of sheaves 
f m : im(/) — > B. If X = (X, JC, r) is a database, we can take the image sheaf im(r) 
of t: fC — ► Ux, and the database (X, im(r),T m ) will be a relational simplicial 
database. 

Lemma 4.4.2. Let -k denote a type specification. There is an adjunction 

DB" < — ^ TleV 

in which the left adjoint is given by (X,fC, r) t— ► (X, im(r), r m ) arid £/ie rig/if adjoint 
is the forgetful functor which realizes a relational simplicial database as a simplicial 
database. 

Proof. This is a simple exercise that reduces to the fact that the image functor, 
which sends the category of sets and functions to the category of sets and injections, 
is a left adjoint to the forgetful functor. 

□ 

Since the forgetful functor T^el^ — > DB" is fully faithful, the counit of the 



adjunction in Lemma 4.4.2 is the identity functor on T^el". Another way to say 
this is that one does not lose information when considering a relational database 
as a simplicial database, but one often does lose information when converting a 
simplicial database to a relational database. Strictly "more information" can be 
contained in a simplicial database than in a relational database. 
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4.5. Tables vs. simplicial databases. In this last subsection we present the 
functor F: Tables — > DB which realizes a table as a simplicial database. We will 
also present the "global table" construction, which roughly takes a database and 
joins everything together to make one big (unnormalized!) table. 

Construction 4.5.1. Let n : U — > DT denote a type specification and (K, C, a, r) a 



table on it (see Definition 2.3.31. Let X = A a e Sch^ be the associated schema, 
let Ux denote the universal database on X, and let ICx denote the constant sheaf 
on Sub(X) which takes each subschema to the set K. Define tx ■ ICx — > Ux in the 
unique way such that tx(X) : Kx{X) — > lix{X) is the function r : K — > T(a). We 
are ready to assign 

F((K, C,a,T)):=(X,ICx,Tx). 
Given a map of tables (p: (Ki,Ci,ai,Ti) — ► (K2, C2, o~2, 72), we will now show 
that there is a canonical map of simplicial databases (Xi,/Cx,Ti) — > (^"2, ^2, 72)- 



Recall from Definition 2.3.8 that <£> = (<?,/) where g: Ki —> K2 is a function and 
/ : (T2 o"i is a morphism of simple schema such that Diagram (JTJ , rewritten for 
the readers convenience here: 

9 /* 

¥ 

^2 ^r(cr 2 ), 

commutes. 

The morphism / : (J2 — > cti of simple schema induces a morphism A " 2 — > A CTl 
of schema, i.e. a map /: — > X\. The sheaf /*/Ci on X2 is the constant sheaf 
with value K\, so g gives a map /" : /*/Ci — » /C2. We will skip some details, but 
one can easily show that the commutativity of the Diagram Q is equivalent to the 
commutativity of Diagram ([T]), completing the construction. 

We can also extract a single table from a simplicial database, by looking at its 



global sections. This requires a functor called / + defined in Section 5.1 We include 



the construction here, rather than later, in order to keep like topics together, and 
conclude nicely with Remark |4.5.3| 

Construction 4.5.2. Let X = (JT, K, r) denote a simplicial database. Recall from 



Remark 4.1.4 that there is an induced classification map s: Xq DT. Assuming 
that X has finitely many vertices, we can construct a table whose simple schema is 
s. 

To do so, we need only note that there is a unique map of schema /: X — » A s . 
Indeed, given any simplex in X, its set of vertices classifies a unique simplex in A s ; 
this defines /. If we write K = K(X) = f+IC(A s ) and t = /+r x (A s ) : K -> L(s), 
then we are ready to construct the table 

(K,X ,s,t) e Tables. 

Its columns are given by the vertices Xq of X; its rows are difficult to describe in 
general, but in specific cases are quite sensible. 

Remark 4.5.3. It is not hard to show that the two above constructions establish an 
adjunction 

Tables ^ -"PR 
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Given a database X, the table obtained by the right adjoint will be called the global 
table on X. 

5. Constructions and formal properties of Simplicial Databases 

The point of the formalism in Section [4]is to find a language in which to describe 
databases such that the typical operations performed when working with databases 
are sensible in the language. In other words, queries of databases should make sense 
as categorical constructions, as they did in Section [3] for tables. 

5.1. Changing the schema. Let us begin with some ways that one can import 
data from one schema into another. In Lemma |4.2.8| we discussed the adjunction 

(5) Shv(Sub(r)) T^ Shv(Sub(X)) 

/. 

induced by a map of schema /: Y — > X. Given a database X = (X,ICx, Tx) on A" 
there is an induced database (Y, f*KxMf ° (f* T x)), denoted f*X; see Definition 
|4.3.2| and refer to the diagram 

f*jc x ru x 

u, 

Uy. 

A slightly more complicated construction creates a database on X from a data- 
base y = (Y,K.y,ty) on Y and a map of schema f:Y^X. By the adjunction 
(|5j, we have the diagram 

(6) U x 

V 

/*/Cy — s- fJAy, 

but since there is no canonical map /*/Cy — > U x , we have not yet constructed a 
database on X. 

To do so, let /_|_(/Cy) denote the limit of Diagram This sheaf comes with a 
canonical map to Ux, which we denote /+Ty : f+JCy — * Ux- The triple 

(XJ+JCyJ+ry) 

is a database on X, which we denote f+y. 

Proposition 5.1.1. Let n denote a type specification, and let f ': Y — > X be a 

morphism of schema of type ir. The functors f* and f + define an adjunction 

r 

U 

Proof. Let X = (X, JCx,tx) and y = (Y, JCy , t y ) be databases. Giving a morphism 
/* X — > y of databases over Y amounts to a giving a map a of sheaves making the 
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diagram 



f*ic x pu x 



a\ 

y 

ICy 



■Uy 



commute. By the adjunction (J5j) this diagram is equivalent to the diagram 



x ~ 

a\ 
V 

f*ICy 



f.TY 



fMy, 



by Lemma |4.2.8| Supplying a morphism a making this diagram commute is equiv- 
alent to supplying a morphism ICx — > f+ICy over Ux, because f+ICy is the limit 
of Diagram [6] The proof now follows from Remark 14.3.91 



□ 



Definition 5.1.2. Let ir denote a type specification, and let /: Y — > X be a 
morphism of schema of type n. The functor /* : DBj — > DBy, defined above, is 
called the pullback functor, and the functor / + : DBy — > DBx, defined above, is 
called the push-forward functor. 

Given a sheaf of sets K x on X, we also refer to f*K x <= Shv(Y) as the pullback 
of ICx, and given a sheaf of sets ICy on F, we also refer to f+JCy e Shv(X) as i/ie 
push-forward oflCy. 

Example 5.1.3. Let X and Y be the schema 

V 'Str' 'Z' V 'Str' .'Str' 



and let /: Y — > X be the unique morphism of schema between them. 

By Remark |4.3.9[ a database on X is given by a morphism of sheaves : ICx — > 
Wx, for some sheaf of sets /Cx- We roughly think of it as a table of strings and 
integers, with some values not filled in. (In fact, Tx has more information because, 
for example, two keys in IC(X) might be sent to the same key in /C(»' str ')). 

The pullback database f*T X ■ f*Kx Uy is degenerate in the sense that every 
row has the same string repeated in two columns. In some sense, this is to be 
expected. 

Now suppose that Ty : JCy — > Uy is a database on Y. We roughly think of it as 
a table whose rows are pairs of strings. The push-forward f+Ty consists of three 
tables: one has two columns (strings and integers) and the other two just have one 
column. The one column table of integers / + Ty(»' z ) is empty. The one column 
table of strings f+Ty (• ) consists of those strings S for which there is a row in 
Ty(Y) consisting of a repeated string (S, S). Finally, the two column table f + Ty(X) 
consists of an element (S, n) for every row S in the one-column table of strings and 
every integer n G Z. 

One sees that by this example that if /: Y — > X is not surjective, then the 
pushforward functor / + results in huge tables. It is not meant to be implemented 
as a hash table but as a theoretical construct. 
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Given a map of schcmas /: Y — > X, there is one more important way to send a 
database on X to a database on Y , but only if / is a monomorphism of schema. A 
monomorphism of schema corresponds to the relationship often known as "is a" , in 
which every object of type x "is an" object of type y. In this situation, there is a 
functor which takes as input a database of y's, and produces as output a database 
of x's with all of the y-information filled in, but nothing else. The functor that 
accomplishes this task is denoted ft : DBy — * DBj^ and is called "extension by 
0," meaning that on every simplex in X that is not in f(Y), the value of the sheaf 
there is an empty table. 

To define ft rigorously, we first notice that /* : Shv(X) — * Shv(y) not only has 
a right adjoint (/*), but a left adjoint as well, which we also denote ft : Shv(F) — > 
Shv(X). If / is a monomorphism, then every subschema Y' C Y is sent to a 
subschema f{Y') c X. 

Let us define fUy and its canonical map to Ux- Every subschema X' C X is 
either of the form X' = f(Y') or not. If so, we set ftUy(X') = U Y {Y') = U X {X'). 
If not, we set ftUy(X') = 0. There is a canonical map aj : fUy — > Ux which is the 
identity map on X' = f(Y') and which is — s- Ux(X') when X' $ im(/). 

Now that we have a canonical map a f : ftUy — ► Ux in the case that / : Y — > X 
is an inclusion, we can define ft : DBy — * DBx to be given by 

/i(y,/Cy,rr):= (X,ftfCy, a/ o r Y ). 

The functor /i is left adjoint to the functor /* : DB^ — > DBy (but /] is defined 
only when /: Y — > X is an injection.) 

5.2. Nulls. Nulls do not conform with the mathematical logic that underlies the 
strict theoretical foundation of relational databases. They are easy enough to deal 
with, however, by use of foreign keys. That is, for each column c S C of a schema 
a: C — > DT for which a table may contain a null, one creates a new schema a' 
on columns C = C — {c}. By an easy use of foreign keys, one considers objects 
classified by a to be also classified by a' . This is a way to get around the problem 
of nulls. Other approaches can be found in [JR03] , 

The same technique is done (automatically) in simplicial databases. Over a 
simplex A a , one puts objects for which the value on each column is known. If the 
value on some set of columns is unknown for a certain object, it is represented as 
a record on the subsimplex for which it is total. 

If one so desired, he or she could implement simplicial databases so that local 
sections of the database (records over subschema) appeared as global sections of the 
database (records over the whole schema) by putting the value "Null" in appropriate 
places. From our perspective it is preferable just to leave local data as local data 
and not try to promote it to global data, at least for theoretical purposes. 

5.3. Duplicate records. SQL allows for a table to have the same record in two 
different rows. Therefore, tables are not relations and SQL does not strictly imple- 
ment relational databases. One could argue that SQL is "wrong" in not conforming 
to the theory (see |Dat051 p. 14]), but perhaps the pure relational theory is overly 
strict; this is the position we take. 

Simplicial database allow for duplicate entries. This should not be threatening 
because internal keys ensure the integrity of the data. If L = A x B x C, then 
relations on this simple schema are subsets K C L. In the theory of simplicial 
databases, we allow non-injective functions r: K — ► L, called tables. 
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Philosophically, we sec the relational model as "confusing the object with its 
attributes." A schema, or set of attributes, gives a set of ways to measure a collection 
of objects. It is entirely possible that two objects in that collection could have the 
same measurements according to the schema. In the relational model, these two 
objects would be identified in the sense that only one row of the table would be 
representing both. From now on, the database and its users will have no choice but 
to consider these objects to be the same. 

The only alternative to this is to introduce arbitrary identifiers. These artificial 
keys are not part of the data being measured about the objects. In our view, it is 
best to keep these arbitrary identifiers "internal" to the database management sys- 
tem. Among several advantages, the most obvious is database integration, in which 
it is important to know what aspects of the data are "measured" and invariant, and 
what aspects are contrived. We will say more about this in Section |6.5.3| 

5.4. Limits and colimits of databases. We will see shortly that limits and 
colimits taken in the category of simplicial databases have meaning in terms of the 
general theory of databases, such as joins and unions. 

Theorem 5.4.1. Let n: U — > DT denote a type specification. The category DB X 
of databases of type it is closed under taking small colimits and small limits. 

Proof. Let / denote a small category and let X: I — > DB denote an /-shaped 
diagram in DB = DB^. There is a functor DB — > Sch op taking a database 
(A, ICa,ta) to its underlying schema A, and composing this functor with X gives 
a functor which we denote X: I — ► Sch op . For an object i E I, we denote the 
database X(i) by Xi and write 

Xi = (Xi, Ki, Ti). 

To define the colimit (respectively limit) of the diagram X , we must first specify 
its schema. Since Sch = Pre(«S), where S is the category of simple schema (sec 



Definition 2.2.61, it is closed under colimits and limits f |MLM94l p. 22]); hence so 
is Sch op . Let C = colim(X) (resp. L — lim(A)) denote the colimit (resp. limit) of 
the diagram X : I — ► Sch op . Let Uc and Ul denote the universal databases on C 
and L, respectively. 

As a colimit in Sch op , the schema C comes equipped with morphisms in c, : C — > 
Xi in Sch, for each i E I, making the appropriate diagrams commute. There is 
a pullback sheaf c*t: c*K,i — > Uc- If / : i — > j is a morphism in /, then the map 
Xj — * Xi in Sch induces a morphism 

of pullback sheaves over Uc on C. Let c* : / — > Shv(C) /u denote the /-shaped 
diagram of these pullback sheaves over Uc- Define tc : JCc — > Uc to be the colimit 
of this diagram. Then the database 

C = {C,Kc,T C ) 

is our candidate for the colimit of the diagram X . It is a matter of tracing through 
the construction to show that C has the necessary universal property. 

Defining the limit of X is similar. As a limit in Sch op , the schema L comes 
equipped with morphisms 1^ : Xi — > L in Sch, for each i E I, making the appropriate 
diagrams commute. There is a push-forward sheaf (£i) + Ki on L, which comes 
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equipped with a map (£i)+r: (^) + /C; — > Ul- If /:«—*■ j is a morphism in 7, then 
the map Xj — > X^ in Sch induces a morphism 

of push-forward sheaves over Ul on L. Let (•£+): 7 — > Shv(£)/^ denote the 7- 
shaped diagram of these push- forward sheaves over Wz,. Define rt : /Cl — » to be 
the limit of this diagram. Then the database 

C={L,1Cl,t l ) 

is our candidate for the limit of the diagram X . Again, it is a matter of tracing 
through the construction to show that C has the necessary universal property. 
This completes the proof. 

□ 

Remark 5.4.2. The final object in the category DB" of databases on it: U — > 
DT is the empty database (with empty schema and trivial sheaf). The initial 
object (X, K, t) in DB" has, as its schema X, a single n-simplex for every map 
a: {0,1, ... ,n} — > DT; the sheaf is K. = Ux, and the map r: Z7x — > Wx is the 
identity. 

If one knows the Cech nerve construction, one can realize the initial object in 
those terms, by applying the Cech nerve functor to tt: U -»■ DT. See |Spi08[ 3.1] 
for details. 

Corollary 5.4.3. Let X G Sch be a schema and let DBx denote the category of 
databases with schema X and with morphisms which restrict to the identity on X. 
Colimits and limits exist in DBx; in particular DBx has an initial object and a 
final object. 

Proof. Given a non-empty diagram which restricts to the identity on a certain 
schema X, one sees by the construction of limits and colimits in the proof of The- 
orem 5.4.1 that the limit and the colimit of that diagram will also have schema 
X. 

The limit (respectively the colimit) of the empty diagram in DBx, if it exists, is 
the final (resp. initial) object in DBx; we must show it does exist. One immediately 
sees that the final object is (X,Ux,idu x ), and the initial object is (X, 0,0 — > Ux), 
where here denotes the sheaf on X whose value is constantly the empty set, and 
where — > Ux is the unique morphism of sheaves. 

□ 

5.5. Projections. This query is built into the theory of simplicial databases. Given 
a database (X, /C,r) and a subschema X' C X, we have the database (X', K,\x't\x') 
given by restricting the sheaf K. and the map of sheaves r : JC — > U to the subschema 



X'. One can view it as a table using Construction 4.5.2 



5.6. Unions and insertions. Given two databases with the same schema, one 
can apply the UNION query. To do so, one keeps the same columns but takes the 
union of the rows. An insertion is a special kind of union; namely it is a union of 
two databases on the same schema, where one of the databases consists only of a 
single row. 

We have a few more options in simplicial databases than one does in relational 
databases; these differences are analogous to the difference between the UNION 
query and the UNION ALL query in SQL. That is, since we allow duplicate entries 
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(see Section 5.3 1, the user can decide when an object in one database is the same 
as an object with the same attributes stored in another database and when it is 
different. Let us make all of this precise. 

We can represent unions, insertions, and more by taking colimits of various 
diagrams of databases. Let X — (X,JC,t) denote a simplicial database, and let 
X' = (X, JC' , t') be a database with the same schema, X. Both receive a map from 
the initial database on X, and the coproduct will be (X, JC II JC' : r II r') as desired. 



(See the proof of Theorem 5.4.1 for details on the colimit construction.) 

The above construction gives a UNION ALL query: duplicated tuples will remain 
distinct. There are two ways of having that not be the case. The first is to simply 
eliminate the duplicates by converting the database to a relational database; see 
Lemma |4.4.2| However, this may result in information loss if there really were two 
entities with the same attributes, because these duplicates will be eliminated. 

The other way can occur if the user has more information about which instances 
in the first database correspond to instances in the second database. This can be 
accomplished by having a third database X" = (X, JC" ,t") and maps from it to X 
and X' . The colimit of this diagram, (X, JC Hjc" JC' , t II t " t'), will be the union of 
the records in X with those in X' , and will identify two records if they agree in X" . 

As mentioned above, inserting a row is a special case of taking the union of 
databases. 

We can take much more general colimits than those mentioned above, all of which 
were constant in the schema. These constructions appear to be new; perhaps they 
can provide useful ways to analyze and assemble data. 

5.7. Join. Two databases can be joined together by specifying a common sub- 
schema of each and "gluing together" along that sub-schema. If no common sub- 
schema is mentioned we take the initial schema, which is empty, and join along that; 
the result is called the natural join. The concept of gluing is rigorously formulated 
as taking limits of certain diagrams in Sch op ; thus the point we are making is that 
joining databases in the usual sense can be accomplished by taking limits in the 
category of simplicial databases. Let us make all of this precise. 

Recall from Theorem |5.4.1| that the limit of the diagram of databases 

{XuJCun) — » (X,JC,r) «— (X 2 ,JC 2 ,t 2 ) 

has schema X' = X\ Ux X 2 . This induces a diagram 

X >• X\ 



x 2 >X' 

in Sch. We can thus push-forward JC\, JC, and JC 2 to X' and get a diagram of push- 



forward sheaves there (see Definition 5.1.21, all naturally mapping to Ux 1 - For 



typographical reasons, we leave out the fact that these are push-forwards and write 
the diagram JC\ — > JC <— JC 2 over Ux 1 ■ We are ready to write the limit database as 

(x t n x x 2l tc 1 x k jc 2 ,t'), 

where r': IC± IC 2 — ► Ux' is the structure map. 

Example 5.7.1. Suppose we have the two schemas pictured here: 

V — 'First'- -'Last' y — 'L.Name'- 'BYcar' 
Ai.— • • , A 2 .— • • , 
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and wish to join them together by equating 'Last' with 'L.Name' (both of which 
have the same data type, namely Str). To do so, we use the schema X = »' str , 
which maps to each of X\ and X2 in an obvious way. 

Now given any databases X\ = (Xi,tCi,T\) and Xi — (X%,K,<2,,t<i) on X\ and 
X2, we can join them by taking the limit of the solid arrow diagram 

X\ x x %i — >■ X% 
1 
1 

y 

X x >■ X 

where X = (X,Ux,idu x ) is the final database on X. The schema of the resulting 
database is 



'First' 'Last'='LName' 'BYear' 



This does not represent a table with three columns, but two tables, each with 
two columns, and each projecting to a common 1-column table. However, its global 
table does have three columns (see Remark 4.5.31. Its records are those triples 
of the form (First,Last, BYear) for which there is a (First, Last) pair in X\ and a 
(Last, BYear) pair in X2 with matching values of Last. This is indeed their join. 

Remark 5.7.2. The "join" we are working with here could be thought of as a com- 
bination of equi-join and outer join. Because databases arc sheaves on a schema, 
they do not have just one table but a system of tables, and the idea of nulls is built 



into the theory (see Section 5.2 1. 

More precisely, if X\ — ► X <— X2 is a diagram of databases, the limit X' repre- 
sents the join of X\ and X2 along a shared set of columns (those of X). Its schema 
is roughly the union of the schemas of X\ and X2- Its global table will be the 
equi-join of the global tables for X\ and <%2. 

The point of this remark, however, is that the new table X' does not only 
contain global information, but local information as well. Much of the data of 
X\ (respectively X2) is preserved upon passage to X' , and that which cannot be 
extended to global data could still be viewed globally if one uses Null values. It is 
in this sense that colimits in DB are related to outer joins. 

When joining databases together, one first chooses a set C of columns to equate. 
When two distinct objects have the same C-attributes, then the join is "lossy" in 
the sense that there will be false information in the join. To remedy this, one must 
be careful to distinguish between objects, even when considered only in terms of 
C . The following example will hopefully make this more clear. 

Example 5.7.3. Suppose one wants to join the following two tables: 



Tl 


Title 


LastName 


1 


Dr. 


Marx 


2 


Mr. 


Marx 



T2 


FirstName 


LastName 


A 


Karl 


Marx 


B 


Groucho 


Marx 
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The outcome will be the following table: 



Title 


FirstName 


LastName 


Dr. 


Karl 


Marx 


Dr. 


Groucho 


Marx 


Mr. 


Karl 


Marx 


Mr. 


Groucho 


Marx 



This table has four entries, two of which are "accurate," in that they describe real 
instances, and two of which are not. This occurs because the relational database 
cannot distinguish between the two instances of the last name Marx. 

Achieving a lossless join is easy, when databases are allowed to have duplicate 
entries with the same attributes. Consider the table 



T 


LastName 


X 


Marx 


y 


Marx 



which accepts maps from both t\ and t% by sending both 1 and A to x, and sending 



both 2 and B to y (see Definition 2.3.81. The limit of this diagram is the table 



Title 


FirstName 


LastName 


Dr. 


Karl 


Marx 


Mr. 


Groucho 


Marx 



as desired. 

In the example above, the table r has two instances of the same string. This is 
not superfluous because there are two people named Marx. They are differentiated 
by their internal keys, but not by their attributes. Keeping distinct objects distinct, 
even if they have the same attributes is very useful in practice. It not only allows 
for lossless joins, but it is well-suited for database integration as well. 



5.8. Select. In Example 3.1 ,10| we selected from a table t with columns C — 
{'First Name', 'Last Name', 'BYear'} all instances for which the value of 'First 
Name' was "Barack." This was computed as follows. First, we made a table r' 
whose column set C consisted of a single element, labeled 'First Name', and filled 
in t' with a single entry, 'Barack'. We might call this table the selection table. 
The SELECT operation was performed by taking the fiber product r — > ide' *~ T ' ■ 
where idc denotes the table of all possible values of 'First Name'. 

Performing SELECT operations in a general simplicial database has the same 
flavor, in that it is always computed as a certain kind of fiber product. Denote 
the database from which we are selecting as X = (X,ICx,tx), let S C X denote 
a subschema and S = (S,ICs,ts) a relational table on S, to serve as the selection 
table. That is, we will be selecting from X all instances that have the designated 
S'-attributes. Finally, we let I5 = (S,Us,idu s ) denote the final database on the 
schema S. The fiber product X$ in the diagram 



Xs 



X 



s 



Y 



is the desired result. 
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5.9. Deletions. Deletion can be subtle. If one deletes entries over a subschema, 
the action must "cascade" up the hierarchy, deleting entries in larger schemas when 
they refer or point to the deleted entries. To that end, we define the following 
construction. 

Definition 5.9.1. Suppose given a schema X and a subsheaf JC% C JC on X. Let 

K\ C /C denote the presheaf on X with 

Ki{X') := {r e IC(X')\3X" C X',X" ^ %,r X n e Kx{X")} 

for subschema X 1 e Sub(Jf). Here rx" denotes the image of r under the restriction 
map fC(X') — > /C(X"). We call /Ci the closure of JC\ in IC. 

Suppose now we want to delete all entries of a given type from a database. More 
concretely, suppose X — (X, ICx ,t~x) is a database with schema X, that i: 5 C X 
is a subschema, and that S = (S,tCs,T~s) is a relational database of objects of this 
subtype, all of which we would like to delete from X. As explained in Section [578] 
we can select the rows of X of the type specified by S by defining Xg to be the 
limit as in the diagram 

Xs -5 



V 

X *- (S,U S , id Us ). 

We know that X$ has schema X = X lis S and we momentarily invent notation 
and write X s = (X,JC S cx,t ScX ). 

The map X$ — > X defines an inclusion of sheaves JCs c x C ICx on X, and we 
take its closure ICs^x C ICx- By construction we can now delete this subsheaf 
objectwise on Sub(JT). That is, we define for X' C X 

K x \s(X')=K x (X')\]Cscx(X'), 

where A\B denotes the maximal subset of A which contains no elements in B. 
The database 

X':= (X,K x \ S ,t), 

where r is shorthand for r\ic xys : K. X \ S — * Ux, is the deletion of S from X. There 
is a canonical map X' — > X in DB, and one can show that X' is the final object 
under X whose join with S is empty. 

6. Applications, advantages, and further research 

In this section, we discuss the applications of the category of simplicial databases. 
First, simplicial databases can be used wherever relational databases are used; 
though simplicial databases arc more general, they are still closed under applying 
the usual queries. On the other hand, there are many advantages to using simplicial 
databases as opposed to relational ones. 

In Section |6.1[ we discuss how the geometry of a schema can provide an intuitive 
picture for the content and layout of a database. As an example of using category 



theory to reason about databases, we show in Section 6.2 that query equivalences 
are trivially verified when one phrases them in categorical language. In Section 
|6.3|we discuss how diagrams of databases can give various users different privileges 



in terms of accessing and modifying data. In Section |6.4| we address the issue of 
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comparing our categorification of databases to others versions. Finally, in Section 
|6.5| we discuss further research on the subject and open questions. 



6.1. Geometric intuition. In Section 4.1 we defined the category Sch w of 
schcmas for a given type specification it. They are based on geometric objects 
called simplicial sets. In this section, we show that the geometry of these objects 
is intuitive and therefore useful in practice. 

Example 6.1.1. In this example, we consider a simplified situation in which one 
keeps track of the cities from which airplane flights take off and those at which 
they land. So suppose we have only one type, DT = {'City'} and U is the set of 
cities in the world that have airports. Let X be the schema 

'City'. .'City' 

For our sheaf of keys /C, we take /C('City') = U. Over the 1-simplex X take IC(X) 
to be the set of pairs (ci, c-i) for which c\ is the city of departure and c 2 is the city 
of arrival for some flight. Let X denote this database of flights. 

Now, joining this database with itself yields a database with schema whose global 



'City' 'City' 'City' 

sections are "flights with layover," i.e. pairs of flights with the destination city of the 
first flight equal to the departing city of the second flight. Similarly, the database 
of multi-city trips of a given length n is simply the union (colimit) of n copies of 
the database of flights X in this way. 

Moreover, if we want to use X to find the set of available round-trips, we simply 



join the ends of the schema in Diagram 6.1.1 to make a circle 




'City' • k 'City' 



This is not just heuristic; we have literally taken the indicated limit of databases. 
The result is a new database whose global sections are precisely the pairs of flights 
which constitute a round-trip. 

The point is that one can intuit this result by visualizing round-trips as circles, 
and then applying that vision to the schemas themselves. 

Example 6.1.2. In 2004, Bearman et al. [BMS04 present data which shows that at 
a certain high school called "Jefferson High," there is a statistically small number of 
sexual couples that later switch partners. That is, if B\ and G\ are sexual partners 
and £?2 and G 2 are sexual partners, then it rarely happens that later B\ mates with 
G2 and B 2 mates with G\. As they say "...we find many cycles of length 4 in the 
simulated networks, but few in Jefferson..." 

Suppose then that we take their raw data and put it on the schema 

'Boyfriend'^ 
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which we denote X. Visually, we represent two boys and two girls who switch 
partners as follows: 

(7) Boys Girls 




(where, say, horizontal lines represent the original partnerships and diagonal lines 
represent the new partnerships). And indeed, we can take the union of four copies 
of X along various vertices to obtain a database with the above 4-cycle schema. 

In other words, there is a way to take raw data over a line segment, representing 
partnerships, and automatically generate data over the "switch schema," Diagram 
([7]), just by taking the indicated limit of databases. The global sections of this 
new "switched partners" database are precisely what is being studied in Bearman's 
paper. 

As in Example |6.1.1| the point is that the shape of the schema is intuitive. 
Using schemas that are geometrically intuitive may enhance the ability of users to 
manipulate and make sense out of the raw data. 

6.2. Query equivalences. It is well known that joining tables together is very 
costly. If one only wishes to consider certain rows or columns of a join, he or she 
should isolate those rows or columns before performing the join, not after. For that 
reason, one is taught to "push selects and projects," i.e. to do these operations 
first. 

How does one prove that projecting first and then joining will result in the same 
database as will joining first and then projecting? The proofs of results like these 
are generally tedious. In this section, we do not claim any new results. We merely 
show that these simple query equivalences are obvious when one uses the language 
of simplicial databases and knows basic category theory. 

For example, it is a standard category-theoretic fact that, in any category C with 
limits, there is a natural isomorphism 

(8) (Ax B C) x D E = (Cx D E) x B A. 



Note that both joins and selects are examples of such limits (see Sections 5.7 and 



The formula (|8| in particular applies to the category DB of databases and 
proves that "selecting E from a join of A and C gives the same result as first 
selecting E from C and then joining the result with A. 

Projecting a database to a subschema is easy to describe in the theory of simpli- 
cial databases: one simply restricts the sheaf K. and the map r to that subschema 



(see Section 5.5). The fact that projects commute with joins follows from basic 
sheaf theory, e.g. that the limit of a diagram of sheaves is the same as the limit of 
the underlying diagram of presheaves. 

6.3. Privileges. The sheaf-theoretic nature of our conception of databases lends 
itself nicely to the idea of privileges. It often happens that one wishes to give a 
particular user the ability to modify certain sections of the database but not others. 
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If X is the schema for a database X ', perhaps we wish to give a particular user the 
ability to modify data on the subschema i: X' C X. 

To accomplish this, note that there is a map of databases 

X={X,K, x ,tx) — > {X',i* K-x j i* Tx ) = X 1 

We allow the user to see A"' as a database and make changes to it (we could also limit 
the ways in which this user can modify X' - only allow insertions, for example). 
At any given time, the user only sees the sub-database X' . 

Suppose he or she adds a few lines to the sheaf i*Kx to make it i*ICx U C To 
update the main database, we take the colimit of the diagram of sheaves 

i\i*ICx *" i\(i*K-x U C) 



fCx 

and the result will be a new sheaf on X with the appropriate insertions. 

Deletions are handled in a somewhat different way, but the idea is the same. If 
the user deletes data from the sheaf i*ICx to obtain the sheaf i*K,x\T^, then to 
update the main database may require us to delete entries from larger schemas (see 



Section 5.9). The updated sheaf on X will be the limit of the diagram 

K x 



V 

i+{i*JC x \V) ^i+i*K x . 

Again, we are not claiming that privileges of this type are anything new. We are 
claiming that they are naturally phrased in this categorical language, thus bringing 
a new and powerful mathematical tool to bear on the problems of the subject. 

6.4. Comparison to other categories of databases. As mentioned in the in- 
troduction, many other categorifications of databases have been presented over the 
years. One of the nice features of category theory is that one can compare various 
categories using functors. Given another categorical formulation of databases, we 
could try to produce a functor from it to DB and from DB back to it. The way 
that these functors behave (e.g. if they are adjoint, or if one or the other is fully 
faithful) will tell us about the relative expressive power of the models, as well as 
understand how to translate between them. We hope to work on such a comparison 
in the future. 

6.5. Further research. The category-theoretic and also geometric nature of sim- 
plicial databases opens up many directions for future research. We present a few 
in this subsection that we intend to pursue. Many of these ideas were suggested to 
us by Paea LePendu. 

6.5.1. Topological methods. First, we would like to consider how we might use meth- 



ods from algebraic topology to study databases. Recall from Example 4.1.5 that 
there is a functor Sch — > Top called topological realization that allows one to 
naturally view any schema as a topological space. Furthermore, we already saw 
in Example |6.1.2| that importing topological ideas can have real world meaning: 
topological 4-cycles represented pairs of mating couples that switched partners. 
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Another example of the usefulness of topological methods is given by "lifting 
problems." Problems of this sort include the famous question "are there three foods, 
each pair of which taste good when eaten together, but the threesome of which tastes 
bad when eaten together?" 

To phrase this in terms of social networks, suppose that for any n people, either 
this group is said to be a friendship group or it is not. The above lifting problem 
becomes: "are there three people, each pair of which is a friendship group, but 
the triple is not?" These types of phenomena can be represented geometrically, so 
having simplicial sets as schema may be useful for their study. 

Homotopical methods from algebraic topology may also be useful. When one 
object "morphs" into another over the course of time (such as a child becoming an 
adult), it is difficult to know how to treat that object in a database. Homotopy 
theory is the study of gradual transformation through time, and the author sees 
some potential for using it to study real-world phenomena. 

Finally, the geometric nature of our schema may be useful for query optimiza- 
tion. Schemas can be classified according to their geometric structure. It may 
be that in performing many queries, a database management system learns that 
some geometric structures are being used more often than others. The patterns 
which emerge may be only visible when one uses schemas that have this higher 
dimensional geometric nature. 

6.5.2. Functional dependencies and normal forms. In this paper we have not dis- 
cussed functional dependencies or normal forms. It is appealing to ask the following 
question: 

Question 6.5.1. Let X E Sch denote a schema; it should be thought of as having a 
shape (again, via the topological realization functor Sch — > Top), namely a union 
of tetrahedra. We wonder: 

(1) Given a set of functional dependencies, is there a natural way to annotate 
the shape X so that these dependencies are made visual? 

(2) Given a schema X that has been annotated in this way, can one easily 
determine whether it is in a certain normal form? 

(3) If an annotated schema is not in normal form, do the annotations help in 
finding the normalization? 

If the answer to these questions is affirmative, we will have more evidence that the 
geometric nature of our schema is useful for database design and management. 

We hope to address these questions in the near future. 

6.5.3. Database integration. We believe that having a rigorous definition for mor- 



phisms of databases (see Definition 4.3.71 will be of use in the problem of database 



integration. The morphisms of databases can account for simultaneous changes in 
schema and in data. It is also easy to allow changes in data types as well, a topic 
we will address in later work. 

Also, as mentioned in Remark |4.3.4| and Section |5.3| the use of internal keys 
should prove immensely valuable. Instead of including an arbitrarily chosen iden- 
tifier for an object as part of the data for that object, as required in the theory of 
relational databases, our theory keeps these arbitrary identifiers separate. When 
attempting to integrate databases, it is imperative that one know which sections of 
the data are observed and invariant properties of the objects being classified, and 
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which sections of the data are arbitrarily assigned for management reasons. Our 
theory keeps these sections of the data distinct, by use of a sheaf of keys K. that is 
not considered part of the data. 

In future research, we hope to show that database integration is made substan- 
tially easier when one works with a rigorous and geometric model like the one we 
present here. Before we do so, we need to explain how to work with a change in 
type specifications, which is not hard, and how to deal with constraints in the data. 
See Section |6. 5. 5| for our plans in this direction. 

6.5.4. Ontologies and networks. One intuitively knows that there is a connection 
between databases and ontologies. An ontology is meant for organizing knowledge, 
a database is meant for organizing information, and there is a strong correlation 
between the two. In order to make this correlation precise, one must first find 
precise definitions of ontologies and databases. Further, these definitions should be 
phrased in the same language so that they can be compared. Category theory was 
invented for the purposes of comparing different mathematical structures, and as 
such provides a good setting for this project. 

Our plan (see |Spi09| )) for a categorical definition of communication networks 
involves annotating the simplices of a simplicial set with databases. That is, each 
node in a network has access to a database of "what it knows," and connections 
between nodes allows communication via a common language and set of shared 
knowledge. In order to make this precise, we need a precise definition for a category 
of databases, for which Definition |4 . 3 . 7| suffices . 

6.5.5. More exotic types. Throughout this paper, we have fixed a type specification 
7r: U — ► DT, where DT is a set of data types, and U is the disjoint union of 
the corresponding domains. This allows for types like strings, characters, dates, 
integers, etc. It also allows for more general types like "functions from A to B" or 
"probability distributions on a space." 

However, as flexible as our type specifications may be, the situation can be gen- 
eralized considerably by allowing 7r to be a functor between categories, rather than 
a function between sets. The simplest application is one that is already implicitly 
used, namely sorting data. The set of strings is in fact an ordered set, and so can be 
represented as a category (with a morphism from A to B if B is lexicographically 
larger than A). Another application comes from putting constraints in the data, 
like if we only allow (city, state) pairs for which the city is within the state. 

By generalizing type specifications to include categories rather than sets, we open 
up many new possibilities for making sense of data. Causal relationships can be 
represented, as can processes. In short, morphisms make the theory more dynamic. 
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